{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0c031327-2456-41a2-b0ef-975bf96823c7",
   "metadata": {},
   "source": [
    "## How to reindex a collection\n",
    "This notebook will walk through the process of ingesting a dataset and loading a collection in Milvus. After the collection is ingested, we will show how a user, can reindex a collection using the `reindex_collection` function. With this the user is able to grab all data from the collection itself, in the case that the user does not have access to the original corpus. When reindexing a collection, all collection related metadata will be conserved with each element. This function, pulls all the data the identified collection and \n",
    "\n",
    "First step is to annotate all the necessary variables to ensure our client connects to our pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a902bd2d-cf8e-4b68-8a98-a5b535e440d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_name=\"nvidia/llama-3.2-nv-embedqa-1b-v2\"\n",
    "hostname=\"localhost\"\n",
    "collection_name = \"nv_ingest_collection\"\n",
    "sparse = True"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "157d8909-542b-47fd-b01c-6689eefdaf11",
   "metadata": {},
   "source": [
    "Next step, instantiate your ingestor object with all the stages you want in your pipeline. Ensure that you have a vdb_upload stage, as this is what will load your transformed elements(data) in to the vector database. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6f9e2a4-7e50-491d-a0c6-21a4d4f27db9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from nv_ingest_client.client import Ingestor\n",
    "\n",
    "ingestor = ( \n",
    "    Ingestor(message_client_hostname=hostname)\n",
    "    .files([\"data/woods_frost.pdf\", \"data/multimodal_test.pdf\"])\n",
    "    .extract(\n",
    "        extract_text=True,\n",
    "        extract_tables=True,\n",
    "        extract_charts=True,\n",
    "        extract_images=True,\n",
    "        text_depth=\"page\"\n",
    "    ).embed(text=True, tables=True\n",
    "    ).vdb_upload(\n",
    "        collection_name=collection_name, \n",
    "        milvus_uri=f\"http://{hostname}:19530\", \n",
    "        sparse=sparse, \n",
    "        minio_endpoint=f\"{hostname}:9000\", \n",
    "        dense_dim=2048\n",
    "    )\n",
    ")\n",
    "results = ingestor.ingest_async().result()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef0a8aeb",
   "metadata": {},
   "source": [
    "Once you have completed the normal ingestion, the collection will have been loaded into your Vector Database. If you need to reindex that data for whatever reason, you can simply run the `reindex_collection` function and supply the necessary parameters. There is a full list of parameters in the docstring of the function, with many defaults already set for you. This function is desigend to be used when the results from your ingestor pipeline are no longer available. You might have ingested this information at a previous date/time and the ingestor results are no longer in memory. This function allows you to query the data from the vector database to recreate those results and send them into a new collection or the same collection, effectively replacing the previous information stored in that collection. \n",
    "\n",
    "In this example we will reindex under the same collection name, replacing the data in the collection. You can always supply a `new_collection_name` as one of the arguments to the function allowing you to save the reindex in another collection. The function supplies a `write_dir` parameter which allows you to pull the data from the collection and write it into files in batches, relieving memory pressure. Currently the batch_size is automatically set to the default query batch_size for the vector database. The `write_dir` option is meant to be used when the data is larger than the available resources, with this option reindexing is slower than when holding the data in host memory.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1396906e-321a-4ab6-af83-9e651a51cb7f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from nv_ingest_client.util.milvus import reindex_collection\n",
    "\n",
    "reindex_collection(\n",
    "    collection_name=collection_name, \n",
    "    sparse=sparse\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e050ecd6-714b-4297-b90f-528dc15b4f08",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
